---
title: "BAM Phase 1: Preprocessing Documentation"
subtitle: "LNS 2006 Data Preparation for GMM Clustering"
author: "Jessala A. Grijalva"
date: "September 2023"
format:
  pdf:
    toc: true
    toc-depth: 3
    keep-tex: false
execute:
  eval: false
  warning: false
  message: false
---

## Overview

This document describes the preprocessing steps applied to the Latino National Survey
(LNS) 2006 dataset to prepare it for Gaussian Mixture Model (GMM) clustering analysis.
The output of this pipeline is used to derive the Bicultural Acculturation Measure (BAM)
orientations. Replicators should read this file alongside the LNS 2006 codebook located
in `docs/`.

::: {.callout-important}
## Reproducibility Note
This script is provided for **documentation purposes only**. The multiple imputation
step (Section 4) uses MICE with Predictive Mean Matching (PMM), which is stochastic
and produces results that vary across R versions and platforms even with identical seeds.
The canonical preprocessed dataset is preserved as `data/processed/lns_clean.rda` and
**should be used for all downstream analyses**. Do not re-run this script expecting to
reproduce the exact dataset.
:::

## Setup

All packages are loaded via `pacman`. If `pacman` is not installed, it will be installed
automatically. `here` is loaded first to enable portable file paths relative to the
project root — open `bam-scale.Rproj` before running any scripts.

```{r setup}
library(here)
library(pacman)
p_load(dplyr, tidyr, mice, forcats, skimr, stringr, readr)

set.seed(500)  # Imputation seed (platform-dependent; see Section 4)
```

## Step 1: Load Raw Data

Raw data is sourced from ICPSR Study 20862 (Latino National Survey, 2006). The file
`20862-0003-Data.rda` should be placed in `data/raw/` and is **not included in this
repository** due to ICPSR data use restrictions. Download directly from:
https://www.icpsr.umich.edu/web/ICPSR/studies/20862

```{r load-data}
load(here("data", "raw", "20862-0003-Data.rda"))
```

## Step 2: Initial Subsetting

The analysis is restricted to US-born, Puerto Rico-born, or naturalized Latino
respondents. This excludes foreign-born non-citizens, who are not the target population
for the BAM. Variables are selected to cover acculturation orientations, political
attitudes, and demographic controls.

```{r subset}
lns_subset <- da20862.0003 %>%
  dplyr::filter(BORNUS %in% c("(1) Mainland US", "(2) Puerto Rico") |
                NATUSCIT == "(1) YES") %>%
  dplyr::select(
    AGE, SEX, REDUC, HHINC, RELIGION, ANCESTRY, BORNUS, NATUSCIT, BIRTHPLC,
    PARBORN, GRANBORN, AMERICAN, RGIDENT, LAIDENT, KEEPSPAN, LEARNENG,
    BLEND, DISTINCT, IMMPOLICY, DREAMACT, IMMVIEW, IDEOLOGY, RACE1,
    FEELPART, PARTYID, PRIMEID, VOTEPRES, NONVPRES, SAYSO, GOVTRUST,
    INCSUPP, HEALTH, WT_NATION_REV
  ) %>%
  dplyr::rename(WEIGHT = WT_NATION_REV)
```

## Step 3: Feature Engineering

### Generational Status

Generational status is derived from birthplace and parental birthplace. Three categories
are constructed: First Generation (foreign-born), Second Generation (US-born with at
least one foreign-born parent), and Third Generation Plus (US-born with both parents
US-born). Respondents who cannot be classified are dropped.

```{r gen-status}
lns_subset <- lns_subset %>%
  mutate(
    GEN_STATUS = case_when(
      BORNUS == "(3) Some other country" ~ "First Generation",
      BORNUS %in% c("(1) Mainland US", "(2) Puerto Rico") &
        PARBORN != "(2) Both parents born in the U.S." ~ "Second Generation",
      BORNUS %in% c("(1) Mainland US", "(2) Puerto Rico") &
        PARBORN == "(2) Both parents born in the U.S." ~ "Third Generation Plus",
      TRUE ~ "Other"
    )
  ) %>%
  filter(GEN_STATUS != "Other")

lns_subset$GEN_STATUS_num <- as.numeric(case_when(
  lns_subset$GEN_STATUS == "First Generation" ~ 1,
  lns_subset$GEN_STATUS == "Second Generation" ~ 2,
  lns_subset$GEN_STATUS == "Third Generation Plus" ~ 3,
  TRUE ~ NA_real_
))
```

### Clustering Variables (VARS5)

Five variables form the basis for GMM clustering. Each is recoded to a consistent
numeric scale. See the Output Summary table for variable descriptions and ranges.

```{r vars5}
# AMERICAN: Strength of American identity (1=Not at all, 4=Very strongly)
lns_subset$AMERICAN_num <- as.numeric(factor(lns_subset$AMERICAN,
  levels = c("(1) Not at all", "(2) Not very strongly",
             "(3) Somewhat strongly", "(4) Very strongly")))

# LEARNENG: Importance of learning English (1=Not at all, 4=Very important)
lns_subset$LEARNENG_num <- as.numeric(factor(lns_subset$LEARNENG,
  levels = c("(1) Not at all important", "(2) Not very important",
             "(3) Somewhat important", "(4) Very Important")))

# CULTURAL_IDENTITY: Mean of regional identity (RGIDENT) and Latino identity
# (LAIDENT), both on 1-4 scales. Higher = stronger cultural identification.
lns_subset$RGIDENT_num <- as.numeric(factor(lns_subset$RGIDENT,
  levels = c("(1) Not at all", "(2) Not very strongly",
             "(3) Somewhat strongly", "(4) Very strongly")))
lns_subset$LAIDENT_num <- as.numeric(factor(lns_subset$LAIDENT,
  levels = c("(1) Not at all", "(2) Not very strongly",
             "(3) Somewhat strongly", "(4) Very strongly")))
lns_subset$CULTURAL_IDENTITY <- rowMeans(
  cbind(lns_subset$RGIDENT_num, lns_subset$LAIDENT_num), na.rm = TRUE)

# DISTINCT: Importance of maintaining a distinct Latino culture (1-3)
lns_subset$DISTINCT_num <- as.numeric(lns_subset$DISTINCT)

# KEEPSPAN: Importance of keeping Spanish (1-4)
lns_subset$KEEPSPAN_num <- as.numeric(lns_subset$KEEPSPAN)
```

### Political and Outcome Variables

Political variables are recoded for interpretability. Direction of coding is noted
inline. These variables are not used in clustering but serve as outcome measures in
downstream analyses.

```{r political-vars}
# IDEOLOGY: Recoded so higher = more liberal
# 1 = Conservative, 2 = Moderate / Don't know, 3 = Liberal
lns_subset <- lns_subset %>%
  mutate(IDEOLOGY = case_when(
    IDEOLOGY == "(1) Conservative" ~ 1,
    IDEOLOGY == "(2) Liberal" ~ 3,
    TRUE ~ 2
  ))

# PARTYID: Recoded so higher = more Democratic
# 1 = Republican, 2 = Independent / Other, 3 = Democrat
lns_subset <- lns_subset %>%
  mutate(PARTYID = case_when(
    PARTYID == "(2) Republican" ~ 1,
    PARTYID == "(1) Democrat" ~ 3,
    TRUE ~ 2
  ))

# IMMVIEW: Reverse coded so higher = more pro-immigrant
# 1 = immigrants are a burden, 2 = immigrants strengthen the country
lns_subset <- lns_subset %>%
  mutate(IMMVIEW = case_when(
    grepl("strengthen", IMMVIEW) ~ 2,
    grepl("burden", IMMVIEW) ~ 1,
    TRUE ~ NA_real_
  ))

# DREAMACT: Reverse coded so higher = stronger support
# 1 = Strongly Support, 4 = Strongly Oppose (original LNS coding reversed)
lns_subset$DREAMACT_num <- as.numeric(factor(lns_subset$DREAMACT,
  levels = c("(4) Strongly Support", "(3) Support",
             "(2) Oppose", "(1) Strongly Oppose")))

# IMMPOLICY: 5-point scale from most restrictive (1) to most permissive (5)
lns_subset <- lns_subset %>%
  mutate(IMMPOLICY = case_when(
    grepl("Immediate legalization", IMMPOLICY) ~ 5,
    grepl("guest worker.*legalization eventually", IMMPOLICY) ~ 4,
    grepl("None of these", IMMPOLICY) ~ 3,
    grepl("guest worker.*permits", IMMPOLICY) ~ 2,
    grepl("seal or close", IMMPOLICY) ~ 1,
    TRUE ~ NA_real_
  ))
```

## Step 4: Multiple Imputation (MICE)

Missing values on key clustering and outcome variables are imputed using Multivariate
Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM). Five
imputed datasets are generated; the first is used for all subsequent analyses. This
approach follows best practices for handling item nonresponse in survey data.

::: {.callout-warning}
## Stochastic Process
PMM involves random sampling from donor pools. Results will differ across R versions
and platforms even with an identical seed. **Use `data/processed/lns_clean.rda`** for
exact replication of published results.
:::

```{r mice}
vars_to_impute <- c("CULTURAL_IDENTITY", "AMERICAN_num", "DISTINCT_num", "BLEND_num",
                    "VOTE_PREF_num", "SAYSO_num", "INCSUPP_num", "HEALTH_num",
                    "DREAMACT_num", "AGE", "RACE1", "HHINC")

imputed_data <- mice(
  lns_subset[, vars_to_impute],
  m = 5,           # Number of imputed datasets
  maxit = 20,      # Maximum iterations per variable
  method = "pmm",  # Predictive Mean Matching
  seed = 500
)

# Use first imputed dataset for downstream analyses
single_completed_data <- complete(imputed_data, 1)
lns_subset[vars_to_impute] <- single_completed_data[vars_to_impute]
```

## Step 5: Final Cleaning

Variables are subset to the final analytical set and the `_num` suffix is removed from
recoded variables for cleaner output in downstream analyses.

```{r final-clean}
lns_subset <- lns_subset %>%
  dplyr::select(
    CULTURAL_IDENTITY, AMERICAN_num, LEARNENG_num, DISTINCT_num,
    KEEPSPAN_num, BLEND_num, VOTE_PREF_num, IDEOLOGY_num, FEELPART_num,
    PARTYID_num, SAYSO_num, GOVTRUST_num, INCSUPP_num, HEALTH_num,
    DREAMACT_num, IMMVIEW_num, IMMPOLICY_num,
    EDUCATION, NATIONALITY, BIRTH_ORIGIN, GEN_STATUS_num, AGE, SEX,
    RACE1, HHINC, RELIGION, WEIGHT
  )

# Remove _num suffix for cleaner variable names in output
names(lns_subset) <- gsub("_num", "", names(lns_subset))
```

## Step 6: Save Preprocessed Data

The cleaned dataset is saved to `data/processed/`. This file is the canonical input
for all Phase 2 and downstream analyses.

```{r save}
save(lns_subset, file = here("data", "processed", "lns_clean.rda"))
```

## Output Summary

| Attribute | Value |
|-----------|-------|
| Observations | 4,785 |
| Variables | 27 |
| Missing values | 0 (after imputation) |

### Clustering Variables (VARS5)

| Variable | Description | Scale | Direction |
|----------|-------------|-------|-----------|
| AMERICAN | Strength of American identity | 1–4 | Higher = stronger |
| CULTURAL_IDENTITY | Mean of regional + Latino identity | 1–4 | Higher = stronger |
| KEEPSPAN | Importance of keeping Spanish | 1–4 | Higher = more important |
| DISTINCT | Importance of maintaining distinct culture | 1–3 | Higher = more important |
| LEARNENG | Importance of learning English | 1–4 | Higher = more important |

---

## Session Info (Original Analysis)

The original preprocessing was conducted in **September 2023** with:

- R version ~4.1.x
- `mice` package version current at that time
- Platform: macOS

*For exact replication, use the preserved `data/processed/lns_clean.rda` file rather
than re-running this script.*
